YouTube (the world-famous video sharing website) maintains a list of the top trending videos on the platform. According to Variety magazine, “To determine the year’s top-trending videos, YouTube uses a combination of factors including measuring users interactions (number of views, shares, comments, and likes). Note that they’re not the most-viewed videos overall for the calendar year”. Top performers on the YouTube trending list are music videos (such as the famously virile “Gangam Style”), celebrity and/or reality TV performances, and the random dude-with-a-camera viral videos that YouTube is well-known for.
The given data set has data for 5 different regions. This report is a work on USA dataset which has a daily record of the top trending YouTube videos from November 2017 to June 2018.
Here we will mainly focus on what categories make most top trending videos and how are they received by the users(based on likes, dislikes, and comments) and if its associated to any big channels.
#lets import all the packages we need.
import pandas as pd
import numpy as np
import json
#from datetime import datetime
#from PIL import Image
#from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
%matplotlib inline
warnings.filterwarnings('ignore')
Get Data
#load the csv file
us_videos = pd.read_csv('USvideos.csv')
#get the data from json file
#create a empty dictionary to store the data from json file
category_dict = {}
#open the file
with open("US_category_id.json") as file:
categories = json.load(file)["items"] #items kep has id and category name
for category in categories:#loop for getting the id and title and storing it in dictionary
category_dict[int(category["id"])] = category["snippet"]["title"]
#saving the extracted category name to the original dataframe
us_videos['category_name'] = us_videos['category_id'].map(category_dict)
#view the files
us_videos.info() #dtypes and number of records
us_videos.head() #glimpse of data
Lets check for quality and tidiness issues with the dataset.
#lets check how many unique values we have in the dataset
us_videos.nunique()
#lets see if we have duplicates in dataset and also duplicates based on video_id
sum(us_videos.duplicated()), sum(us_videos.video_id.duplicated())
#lets view the duplicates in complete dataset
us_videos[us_videos.duplicated(keep = False)]
#lets check one id
us_videos[us_videos.video_id == 'QBL8IRJ5yHU']
#lets check if the data has any null values
us_videos.isnull().sum()
us_videos[us_videos.comments_disabled == True]
48 duplicate records in dataset
34598 duplictaes based on video_id
null values in description (can ignore)
drop unwanted columns
trending_date and publish_time should be divided into 3 columns as day, month, year
#before we start cleaning the dataset lets take a copy of it
df_videos = us_videos.copy()
df_videos.head()
48 duplicate records in dataset
Define:
We need to drop the duplicates in the dataset which can be done using drop_duplicates in pandas.
Code
#dropping the duplicates
df_videos.drop_duplicates(inplace = True)
Test
#check if the duplictaes are dropped
df_videos.duplicated().sum()
34598 duplictaes based on video_id
Define:
The multiple records based on video_id are based on same video to be as a treding video for multiple days. So we can not drop the duplicates based on video_id, rather we will keep single last records which will have all the updated data of previous days such as likes, dislikes, and comments.
Code
# lets drop the records by keeping video_id as reference
# also making sure we keep the last record in the duplicates
df_videos.drop_duplicates(subset = 'video_id', keep = 'last', inplace = True)
Test
#lets check if the duplictaes are gone
df_videos.video_id.duplicated().sum()
#Also lets check how many records we have now in the dataset
df_videos.info()
null values in description
Define:
The description only tells us about what the video is about. So we can ignore the null values.
drop unwanted columns
Define:
There are few unwanted columns in the dataset which we will not be using in this report for further analysis. We will drop the following columns category_id, publish time, tags, thumbnail_link, comments disabled, ratings_disables, video_error_or_removed
Code
#lets now drop the unwantedcolumns
df_videos.drop(['category_id', 'trending_date','tags', 'thumbnail_link', 'comments_disabled',
'ratings_disabled', 'video_error_or_removed','description'], axis = 1, inplace = True)
Test
#lets check the records we have now
df_videos.info()
trending_date and publish_time should be divided into 3 columns as day, month, year
Define:
As we are already dropping the publish_date we only need to change the trending_date datatype to datetime and extract the year and month using dt in pandas
Code
#lets fist change the trending_date to datetime datatype
df_videos.publish_time = pd.to_datetime(df_videos.publish_time)
# lets get year and month from the publish_time
df_videos['year'] = df_videos['publish_time'].dt.year
df_videos['month'] = df_videos['publish_time'].dt.month_name()
df_videos['month_num'] = df_videos['publish_time'].dt.month
#Let us also get weekday name and hour from the publish_time
df_videos['day'] = df_videos['publish_time'].dt.weekday_name
df_videos['hour'] = df_videos['publish_time'].dt.hour
#df_videos.hour = pd.to_datetime(df_videos.publish_time, format= '%I %p')
#df_videos.hour = df_videos.publish_time.strftime('%I %p')
# Now lets drop the publish_time column
df_videos.drop(['publish_time'], axis = 1, inplace = True)
Test
#Lets check our dataset
df_videos.info()
df_videos.head()
#hour is in int lets change it to object for better plotting experience
df_videos['hour'] = df_videos['hour'].astype('object')
df_videos['month_num'] = df_videos['month_num'].astype('object')
#check if data types
df_videos.dtypes
Lets go ahead and analyze our cleaned data and see what we can learn about the data. We will go ahead and see all the quantitative variables and qualitative variables functions and their spread. we will see the data by exploring it in Univariate, Bivariate and Multivariate exploration. We will also perform calculations on already present variables to explore and analyze data in a better way.
#lets check the statistics of quantitative variables in the data
df_videos.describe()
Univariate Exploration of Data
Univariate Exploration of data helps us learn about individual variables be it quantitative and qualitative variables. We will try to analyse views, likes, dislikes, comments count, publish details, type of videos published(category) and what channels published the videos. we will analyze all the above mentioned variables individually.
# functions for count plots
def hist_plot(data, x, bin_edges, title, figsize, typee):
if typee == 'normal':
plt.figure(figsize = [figsize[0], figsize[1]])
# configuring bins bin_edges is a list like object with min, max, and interval
bin_edges = np.arange(bin_edges[0], bin_edges[1] + bin_edges[2], bin_edges[2])
#Lets plot
plt.hist(data=data, x=x, bins = bin_edges, rwidth = 0.8);
#set axis labels and title
plt.xlabel(x.upper()) # Change column name to uppercase
plt.ylabel('COUNT')
plt.title(title)
elif typee == 'lim':
plt.figure(figsize = [figsize[0], figsize[1]])
#set limit
plt.xlim(bin_edges[0], bin_edges[1])
# configuring bins bin_edges is a list like object with min, max, and interval
bin_edges = np.arange(bin_edges[0], bin_edges[1] + bin_edges[2], bin_edges[2])
#Lets plot
plt.hist(data=data, x=x, bins = bin_edges, rwidth = 0.8);
#set axis labels and title
plt.xlabel(x.upper()) # Change column name to uppercase
plt.ylabel('COUNT')
plt.title(title)
elif typee == 'log':
plt.figure(figsize = [figsize[0], figsize[1]])
# configuring bins bin_edges is a list like object with min, max, and interval
bin_edges = 10 ** np.arange(bin_edges[0], bin_edges[1] + bin_edges[2], bin_edges[2])
#Lets plot
plt.hist(data=data, x=x, bins = bin_edges);
#set scale
plt.xscale('log');
#set axis labels and title
plt.xlabel(x.upper()) # Change column name to uppercase
plt.ylabel('COUNT')
plt.title(title)
else:
print('Please check typee')
# Before plotting the data lets see what values we have in 'views'
df_videos.views.sort_values(ascending = False)
#lets plot the views data
# We can see the least value is 559 which will be the min limit for our bins and max value of the views as max limit
hist_plot(df_videos, 'views', [559, df_videos.views.max(), 5000000], 'Distribution of Views', [12,5], 'normal')
Comment:
The above plot doesn't show us much details of how the data of views is spread. Lets try and check the spread of data by setting limits.
# lets check the statistics of views data
df_videos.views.describe()
As mentioned earlier we will now set axis limits and see if we can get a better view of the spread of data. we will plot 3 subplots showing us 3 different limits of the data.
# first histogram: focus in on bulk of data < 2000000
hist_plot(df_videos, 'views', [0, 2000000, 40000],
'Distribution of views: focus in on bulk of data < 2000000', [15,5], 'lim')
# second histogram: focus in on bulk of data > 2000000 and < 20000000
hist_plot(df_videos, 'views', [2000000, 20000000, 500000],
'Distribution of views: focus in on bulk of data > 200000 and < 2000000', [15,5], 'lim')
# third histogram: focus in on bulk of data > 20000000
hist_plot(df_videos, 'views', [20000000, 250000000, 8000000],
'Distribution of views: focus in on bulk of data > 2000000', [15,5], 'lim')
Comment:
As we broke our data into 3 parts we can now see a better spread of data but we still don't have a clear picture of the data spread of views. But now we also know that most videos have between 10K and 10M views. TO get a clear picture of the data sperad lets try and apply log calculations on the data and see if we can get a better distribution of data.
# Get the statistics of data by applying log
np.log10(df_videos.views.describe())
#log plot
hist_plot(df_videos, 'views', [2.5, 9, 0.1],
'Distribution of views over log scale', [15,8], 'log')
Comment:
From the above plot we get a better distribution of the 'views' data. The plot previously was right skewed but now we have normally distributed data.
# Before plotting the data lets see what values we have in 'likes'
df_videos.likes.sort_values(ascending = False)
#set the bin limits according to the values of 'likes' in the data
hist_plot(df_videos, 'likes', [0, df_videos.likes.max(), 1000000],
'Distribution of Likes', [8,5], 'normal')
Comment:
The above plot doesn't show us much details of how the data of 'likes' is spread. Lets try and check the spread of data by setting limits.
# first histogram: focus in on bulk of data < 200000
hist_plot(df_videos, 'likes', [0,200000, 4000],
'Distribution of Likes: focus in on bulk of data < 200000', [15,5], 'lim')
# second histogram: focus in on bulk of data > 200000 and < 2000000
hist_plot(df_videos, 'likes', [200000,2000000, 50000],
'Distribution of Likes: focus in on bulk of data > 200000 and < 2000000', [15,5], 'lim')
# third histogram: focus in on bulk of data > 2000000
hist_plot(df_videos, 'likes', [2000000, 6000000, 100000],
'Distribution of Likes: focus in on bulk of data > 2000000', [15,5], 'lim')
Comment:
As we broke our data into 3 parts we can now see a better spread of data but we still don't have a clear picture of the data spread of likes. But now we also know that most likes have between 0 and 1M likes. To get a clear picture of the data sperad lets try and apply log calculations on the data and see if we can get a better distribution of likes data.
# Get the statistics of data by applying log
np.log10(df_videos.likes.describe())
#plot data
hist_plot(df_videos, 'likes', [0, 7, 0.1],
'Distribution of likes over log scale', [15,8], 'log')
Comment:
From the above plot we get a better distribution of the 'likes' data. The plot previously was right skewed but now we have normally distributed data.
# Before plotting the data lets see what values we have in 'dislikes'
df_videos.dislikes.sort_values(ascending = False)
#plot data
hist_plot(df_videos, 'dislikes', [0, df_videos['dislikes'].max(), 100000],
'Distribution of Dislikes', [8,5], 'normal')
Comment:
The above plot doesn't show us much details of how the data of dislikes is spread. Lets try and check the spread of data by setting limits.
#Lets check the statistics of dislikes
df_videos.dislikes.describe()
# first histogram: focus in on bulk of data < 20000
hist_plot(df_videos, 'dislikes', [0, 20000, 400],
'Distribution of dislikes: focus in on bulk of data < 20000', [15,5], 'lim')
# second histogram: focus in on bulk of data > 20000 and < 200000
hist_plot(df_videos, 'dislikes', [20000, 200000, 5000],
'Distribution of dislikes: focus in on bulk of data > 200000 and < 2000000', [15,5], 'lim')
# third histogram: focus in on bulk of data > 200000
hist_plot(df_videos, 'dislikes', [200000, 2000000, 80000],
'Distribution of dislikes: focus in on bulk of data > 2000000', [15,5], 'lim')
Comment:
As we broke our data into 3 parts we can now see a better spread of data but we still don't have a clear picture of the data spread of dislikes. But now we also know that most videos have between 0 and 180k dislikes. TO get a clear picture of the data sperad lets try and apply log calculations on the data and see if we can get a better distribution of dislikes data.
# Get the statistics of data by applying log
np.log10(df_videos.dislikes.describe())
#plot data
hist_plot(df_videos, 'dislikes', [0, 7, 0.1],
'Distribution of likes over log scale', [15,8], 'log')
Comment:
From the above plot we get a better distribution of the 'dislikes' data. The plot previously was right skewed but now we have normally distributed data. The plot also has outliers.
# Before plotting the data lets see what values we have in 'comment_count'
df_videos.comment_count.sort_values(ascending = False)
#plot data
hist_plot(df_videos, 'comment_count', [0, df_videos['comment_count'].max(), 70000],
'Distribution of Comment Count', [8,5], 'normal')
Comment:
The above plot doesn't show us much details of how the data of comment_count is spread. Lets try and check the spread of data by setting limits.
# first histogram: focus in on bulk of data < 20000
hist_plot(df_videos, 'comment_count', [0, 20000, 400],
'Distribution of comments count: focus in on bulk of data < 20000', [15,5], 'lim')
# second histogram: focus in on bulk of data > 20000 and < 200000
hist_plot(df_videos, 'comment_count', [20000, 200000, 5000],
'Distribution of comments count: focus in on bulk of data > 200000 and < 2000000', [15,5], 'lim')
# third histogram: focus in on bulk of data > 200000
hist_plot(df_videos, 'comment_count', [200000, 2000000, 50000],
'Distribution of comments count: focus in on bulk of data > 2000000', [15,5], 'lim')
Comment
As we broke our data into 3 parts we can now see a better spread of data but we still don't have a clear picture of the data spread of comments count. But now we also know that most videos have between 0 and 200K comments count. To get a clear picture of the data sperad lets try and apply log calculations on the data and see if we can get a better distribution of comments count data.
# Get the statistics of data by applying log
np.log10(df_videos.comment_count.describe())
#plot data
hist_plot(df_videos, 'comment_count', [0, 7, 0.1],
'Distribution of comments count over log scale', [15,8], 'log')
Comment:
From the above plot we get a better distribution of the 'comment_count' data. The plot previously was right skewed but now we have normally distributed data.
Lets now check the qualitative variables of the data like category, channel, month, day, hour
First lets write a function for count plot
base_color = sns.color_palette()[0]
def count_plot(figsize, data, x, angle, title, typee):
if typee == 'vertical':
#figsize
plt.figure(figsize = [figsize[0], figsize[1]])
#set days based on most videos publised in order
val = data[x].value_counts()
val_order = val.index
#plot the data
sns.countplot(x = x, data = data, order = val_order, color = base_color);
# axis and titles
plt.xticks(rotation= angle);
plt.xlabel(x.upper()) # Change column name to uppercase
plt.ylabel('COUNT')
plt.title(title)
elif typee == 'horizontal':
#figsize
plt.figure(figsize = [figsize[0], figsize[1]])
#set days based on most videos publised in order
val = data[x].value_counts()[:20]
val_order = val.index
#plot the data
sns.countplot(y = x, data = data, order = val_order, color = base_color);
plt.xticks(rotation= angle);
plt.ylabel(x.upper()) # Change column name to uppercase
plt.xlabel('COUNT')
plt.title(title)
First lets plot categories of trending videos
#plot
count_plot([15,6], df_videos, 'category_name', 90, 'Most trending videos based on Categories', 'vertical')
Comment:
From the above plot we can see that entertainment category has more trending videos followed by music and how to style and the least number of trending videos are from Nonprofits& Activism and shows. Entertainment category has published more than 1600 videos, music category has published 800 videos and how to style has around 600 published videos.
Lets now check which channel has published most trending videos
#plot
count_plot([12,8], df_videos, 'channel_title', 0, 'Most trending videos based from channel title', 'horizontal')
Comment:
From the above plot we can see that ESPN and The ellen show channel has the most trending videos publised. ESPN with more than 80 and The ellen show with more than 70 videos.
Lets now see which month has most published trending videos
#plot
count_plot([12,6], df_videos, 'month', 0, 'Most trending videos based on months', 'vertical')
Comment:
From the above plot we can see that most trending videos were published in January and least were published in July
Lets now see which day has most published trending videos
#plot
count_plot([12,6], df_videos, 'day', 0, 'Most trending videos based on day', 'vertical')
Comment:
From the above plot we can see that most trending videos were published on Wednesday and least were published on Saturday
Lets now see which hour of the day has most published trending videos
#plot
count_plot([12,6], df_videos, 'hour', 0, 'Most trending videos based on hour of the day', 'vertical')
#df_videos['hour'] = pd.to_datetime(df_videos['hour'], format='%I %p', errors = 'ignore')
#df_videos.hour.value_counts()
Comment:
From the above plot we can see that most trending videos were published at 4 in the evening and least were published at 9 in the morning.
Bivariate Exploration of Data
Bivariate exploration of data involves in analyzing two variables at a time be it 2 quantitative variables or 1 quantitative variable and 1 qualitative variable. It helps us determine how the two variables depend on each other.
Lets write some functions for scatterplots
def scat_plot( figsize, data, x, y, transperancy, title, typee):
if typee == 'normal':
#figsize
plt.figure(figsize = (figsize[0], figsize[1]))
#plot the data
sns.regplot(y= y, x= x, data = data, scatter_kws = {'alpha':transperancy})
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'log_data':
#figsize
plt.figure(figsize = (figsize[0], figsize[1]))
#plot the data
x1= np.log(data[x])
y1= np.log(data[y])
sns.regplot(y= y1, x= x1, data = data, scatter_kws = {'alpha':transperancy})
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'log_scale':
#figsize
plt.figure(figsize = (figsize[0], figsize[1]))
#plot the data
sns.regplot(y= y, x= x, data = data, scatter_kws = {'alpha':transperancy})
#lets view the data in log scale
plt.xscale('log');
plt.yscale('log');
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
else:
print('Please check typee')
Let us first see if we can find any co relation between likes and views of the dataset
#sns.jointplot(df_videos.likes, df_videos.views, kind = 'reg', color='C1', size = 10)
#plot
scat_plot([12,9], df_videos, 'views', 'likes', 1/5 , 'Distribution of Views vs Distribution of Likes', 'normal')
Comment:
The above plot shows us the relation between Views and Likes of trending videos but as we can see the plot has over plotting issues and the data is not distributed for proper understanding. Lets try to apply log scale and see if we can get a better understanding of the data
#plot
scat_plot([12,8], df_videos, 'views', 'likes', 1/5 ,
'Distribution of Views vs Distribution of Likes on log scale(log scale)', 'log_scale')
# lets check the plot in log data and not in log scale
scat_plot([12,8], df_videos, 'views', 'likes', 1/5 ,
'Distribution of Views vs Distribution of Likes on log scale(log data)', 'log_data')
Comment:
With the log scale and log data we can see that the corelation between views and likes is positive and strong.
Lets now check the co relation between dislikes and views
#plot
scat_plot([12,9], df_videos, 'views', 'dislikes', 1/3 ,
'Distribution of Views vs Distribution of Dislikes', 'normal')
Comment:
The above plot shows us the relation between Views and dislikes of trending videos but as we can see the plot has over plotting issues and the data is not distributed for proper understanding. Lets try to apply log scale and see if we can get a better understanding of the data
#plot
scat_plot([15,7], df_videos, 'views', 'dislikes', 1/5 ,
'Distribution of Views vs Distribution of Dislikes(log scale)', 'log_scale')
# lets check the plot in log data and not in log scale
scat_plot([15,7], df_videos, 'views', 'dislikes', 1/5 ,
'Distribution of Views vs Distribution of Dislikes(log data)', 'log_data')
Comment:
With the log scale and log data we can see that the corelation between views and dislikes is also positive but not as strong as the corelation between views and likes.
Lets now check the co relation between comment count and views
#plot
scat_plot([12,9], df_videos, 'views', 'comment_count', 1/5 ,
'Distribution of Views vs Distribution of Comment Count', 'normal')
Comment:
The above plot shows us the relation between Views and comment_count of trending videos but as we can see the plot has large scale issue and the data is not distributed for proper understanding. Lets try to apply log scale, log transformed data and see if we can get a better understanding of the data
#plot
scat_plot([12,9], df_videos, 'views', 'comment_count', 1/3 ,
'Distribution of Views vs Distribution of Comment count(log scale)', 'log_scale')
# lets check the plot in log data and not in log scale
scat_plot([12,9], df_videos, 'views', 'comment_count', 1/5 ,
'Distribution of Views vs Distribution of Comment count(log data)', 'log_data')
Comment:
With the log scale and log data we can see that the corelation between views and comment count is also positive but weak.
Lets now check the co relation between likes and dislikes
#plot
scat_plot([12,9], df_videos, 'likes', 'dislikes', 1/5 ,
'Distribution of Likes vs Distribution of Dislikes', 'normal')
Comment:
The above plot shows us the relation between likes and dislikes of trending videos which is a positive relation but as we can see the plot has large scale issue and the data is not distributed for proper understanding. Lets try to apply log scale, log transformed data and see if we can get a better understanding of the data
#plot
scat_plot([12,9], df_videos, 'likes', 'dislikes', 1/5 ,
'Distribution of Likes vs Distribution of Dislikes(log scale)', 'log_scale')
#plot
scat_plot([12,9], df_videos, 'likes', 'dislikes', 1/5 ,
'Distribution of Likes vs Distribution of Dislikes(log data)', 'log_data')
Comment:
With the log scale and log data we can see that relation between likes and dislikes is positive but not strong
Lets check the relation between likes and comments
#plot
scat_plot([12,9], df_videos, 'likes', 'comment_count', 1/5 ,
'Distribution of Likes vs Distribution of comment count', 'normal')
Comment:
The above plot shows us the relation between likes and comment count of trending videos which is a positive relation but as we can see the plot has large scale, overplotting issues and the data is not distributed for proper understanding. Lets try to apply log scale, log transformed data and see if we can get a better understanding of the data
#plot
scat_plot([12,9], df_videos, 'likes', 'comment_count', 1/5 ,
'Distribution of Likes vs Distribution of comment count(log scale)', 'log_scale')
#plot
scat_plot([12,9], df_videos, 'likes', 'comment_count', 1/5 ,
'Distribution of Likes vs Distribution of comment count(log data)', 'log_data')
Comment:
With the log scale and log data we can see that relation between likes and comment count is positive
Lets check the relation between dislikes and comments
#plot
scat_plot([12,9], df_videos, 'dislikes', 'comment_count', 1/5 ,
'Distribution of Dislikes vs Distribution of comment count', 'normal')
Comment:
The above plot shows us the relation between dislikes and comment count of trending videos which is a positive relation but as we can see the plot has large scale, over plotting issues and the data is not distributed for proper understanding. Lets try to apply log scale, log transformed data and see if we can get a better understanding of the data
#plot
scat_plot([12,9], df_videos, 'dislikes', 'comment_count', 1/5 ,
'Distribution of Dislikes vs Distribution of comment count(log scale)', 'log_scale')
#plot
scat_plot([12,9], df_videos, 'dislikes', 'comment_count', 1/5 ,
'Distribution of Dislikes vs Distribution of comment count(log data)', 'log_data')
Comment:
With the log scale and log data we can see that relation between dislikes and comment count is positive.
Now chects check the relation between quantitative and qualitative variables. First we will write functions for box plots and voilin plots. These plots will help us get descriptive statistics of data and density spread of the data.
def box_plot(figsize, data, x, y, angle, title, typee):
if typee == 'horizontal':
#figsize
plt.figure(figsize= (figsize[0],figsize[1]))
sns.boxplot(data = data, y = y, x = x, color = base_color)
#set tick rotation
plt.xticks(rotation= angle);
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'vertical':
#figsize
plt.figure(figsize= (figsize[0],figsize[1]))
sns.boxplot(data = data, y = x, x = y, color = base_color)
#set tick rotation
plt.xticks(rotation= angle);
#set axis labels and title
plt.xlabel(y.upper())
plt.ylabel(x.upper())
plt.title(title);
elif typee == 'horizontal_log':
#figsize
plt.figure(figsize= (figsize[0],figsize[1]))
sns.boxplot(data = data, y = y, x = x, color = base_color)
#set tick rotation
plt.xticks(rotation= angle);
#set scale
plt.xscale('log');
#plt.yscale('log');
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'vertical_log':
#figsize
plt.figure(figsize= (figsize[0],figsize[1]))
sns.boxplot(data = data, y = x, x = y, color = base_color)
#set tick rotation
plt.xticks(rotation= angle);
#set scale
#plt.xscale('log');
plt.yscale('log');
#set axis labels and title
plt.xlabel(y.upper())
plt.ylabel(x.upper())
plt.title(title);
elif typee == 'horizontal_log_data':
#figsize
plt.figure(figsize= (figsize[0],figsize[1]))
x1= np.log(data[x]+1)
#y1= np.log(data[y])
sns.boxplot(data = data, y = y, x = x1, color = base_color)
#set tick rotation
plt.xticks(rotation= angle);
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'vertical_log_data':
#figsize
plt.figure(figsize= (figsize[0],figsize[1]))
x1= np.log(data[x]+1)
#y1= np.log(data[y])
sns.boxplot(data = data, y = x1, x = y, color = base_color)
#set tick rotation
plt.xticks(rotation= angle);
#set axis labels and title
plt.xlabel(y.upper())
plt.ylabel(x.upper())
plt.title(title);
else:
print('please check typee')
def violin_plot(figsize, data, x, y, angle, title, typee):
if typee == 'horizontal':
#figsize
plt.figure(figsize= (figsize[0],figsize[1]))
sns.violinplot(data = data, y = y, x = x, color = base_color, inner = 'quartile')
#set tick rotation
plt.xticks(rotation= angle);
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'vertical':
#figsize
plt.figure(figsize= (figsize[0],figsize[1]))
sns.violinplot(data = data, y = x, x = y, color = base_color, inner = 'quartile')
#set tick rotation
plt.xticks(rotation= angle);
#set axis labels and title
plt.xlabel(y.upper())
plt.ylabel(x.upper())
plt.title(title);
elif typee == 'horizontal_log':
#figsize
plt.figure(figsize= (figsize[0],figsize[1]))
sns.violinplot(data = data, x = x, y = y, color = base_color, inner = 'quartile')
#set tick rotation
plt.xticks(rotation= angle);
#set scale
plt.xscale('log');
#plt.yscale('log');
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'vertical_log':
#figsize
plt.figure(figsize= (figsize[0],figsize[1]))
sns.violinplot(data = data, y = x, x = y, color = base_color, inner = 'quartile')
#set tick rotation
plt.xticks(rotation= angle);
#set scale
#plt.xscale('log');
plt.yscale('log');
#set axis labels and title
plt.xlabel(y.upper())
plt.ylabel(x.upper())
plt.title(title);
elif typee == 'horizontal_log_data':
#figsize
plt.figure(figsize= (figsize[0],figsize[1]))
x1= np.log(data[x]+1)
#y1= np.log(data[y]+1)
sns.violinplot(data = data, y = y, x = x1, color = base_color, inner = 'quartile')
#set tick rotation
plt.xticks(rotation= angle);
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'vertical_log_data':
#figsize
plt.figure(figsize= (figsize[0],figsize[1]))
x1= np.log(data[x]+1)
#y1= np.log(data[y]+1)
sns.violinplot(data = data, y = x1, x = y, color = base_color, inner = 'quartile')
#set tick rotation
plt.xticks(rotation= angle);
#set axis labels and title
plt.xlabel(y.upper())
plt.ylabel(x.upper())
plt.title(title);
else:
print('please check typee')
Lets check the relation between views and categories
#plot
box_plot([15,12], df_videos, 'views', 'category_name' , 0 ,
'Distribution of Views over Categories', 'horizontal')
Comment:
The Above plot shows us the statistics of views for all the categories but the interpretation is not clear as the data has large sclae and looks overplotted. Lets try and apply the log scale to see if we can get a better picture of the distribution of views across the categories.
#plot
box_plot([15,12], df_videos,'views','category_name', 0,
'Distribution of Views over Categories', 'horizontal_log')
Comment:
After applying the log scale we get a better understanding of the box plaot with clear statistics interpretable. Here we can see that when we apply log scale the data shows outliers only after the max value. Lets see how the data behaves when we acctually apply log to the data and not to the scale
#plot
box_plot([15,12], df_videos,'views','category_name', 0,
'Distribution of Views over Categories', 'horizontal_log_data')
Comment:
After applying the log to the data we see that the plot shows there are outliers before min value and after max value as well.
Lets now see the density spread of views data over categories
violin_plot([15,12], df_videos,'views','category_name', 0,
'Distribution of Views over Categories', 'horizontal')
Comment:
The Above plot shows us the density of data likes for all the categories but the interpretation is not clear as the data has large sclae and looks overplotted because of lard scale and having more data in only perticular range. Lets try and apply the log scale to see if we can get a better picture of the distribution of views across the categories
violin_plot([15,12], df_videos,'views','category_name', 0,
'Distribution of Views over Categories(log scale)', 'horizontal_log')
Comment:
when applying log scale some times the scale doesn't show the complete data scale. As in the above plot we can see that only higher part of the data is plotted on the log scale. Lets see if we can get a clear picture when we actually apply log to the data rather than to the scale.
violin_plot([15,12], df_videos,'views','category_name', 0,
'Distribution of Views over Categories(log data)', 'horizontal_log_data')
Comment:
From the above plots we can see that log scale doesn't show us a proper plot but when we use log applied data we get a better picture of the density spread of data. we can see that we have large scale data where the density of data is less at the min data points and max data points and the density is more in between.
Lets now see the distribution of likes over categories to check the density of data spread.
violin_plot([15,5], df_videos,'likes', 'category_name', 45,
'Distribution of likes over Categories(log data)', 'vertical')
Comment:
The Above plot shows us the density of data likes for all the categories but the interpretation is not clear as the data has large scale and looks overplotted because of it. Lets try and apply the log scale to see if we can get a better picture of the distribution of likes across the categories
violin_plot([15,5], df_videos,'likes', 'category_name', 45,
'Distribution of likes over Categories(log data)', 'vertical_log')
Comment:
when applying log scale some times the scale doesn't show the complete data scale. As in the above plot we can see that only higher part of the data is plotted on the log scale. Lets see if we can get a clear picture when we actually apply log to the data rather than to the scale.
violin_plot([15,5], df_videos,'likes', 'category_name', 45,
'Distribution of likes over Categories(log data)', 'vertical_log_data')
Comment:
The plot with log data shows us the distribution of likes over categories we can see that most of the data shows relatively high variance but the density of 'sports', 'comedy', 'education', 'Pets & Animals' and 'shows' show more density compared to others. We can also see that bimodality is suggested in 'shows', 'Travel & Events' category.
Lets go ahead and check the descriptive statistice of likes over categories
box_plot([15,5], df_videos,'likes', 'category_name', 45,
'Distribution of likes over Categories(log data)', 'vertical')
Comment:
The above plot should help us get a understanding of descriptive statistics of likes over category names. But as the data has large sclae issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log scale is applied.
box_plot([15,12], df_videos,'likes', 'category_name', 45,
'Distribution of likes over Categories(log data)', 'horizontal_log')
Comment:
With the log scale applied we definitely have a better picture of the statistics but we don't have the complete data. The min values of the data are missing in the plot, this happens because of log scale. Lets try and see if we can get a better picture when we plot log transformed data.
box_plot([15,12], df_videos,'likes', 'category_name', 45,
'Distribution of likes over Categories(log data)', 'horizontal_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics of likes over categories. We can see that the data has outliers before min value as well as after max value. Also, we can see that some of the data is negative skewed like shows, gaming,non profit, entertainment, science and tech, travel and event.
Let us now go ahead and check the density spread of dislikes data over categories
violin_plot([15,5], df_videos,'dislikes', 'category_name', 45,
'Distribution of dislikes over Categories(log data)', 'vertical')
Comment:
The Above plot shows us the density of data dislikes for all the categories but the interpretation is not clear as the data has large sclae and looks overplotted because of lard scale and having more data in only perticular range. Lets try and apply the log scale to see if we can get a better picture of the distribution of dislikes across the categories.
violin_plot([15,5], df_videos,'dislikes', 'category_name', 45,
'Distribution of dislikes over Categories(log data)', 'vertical_log')
Comment:
when applying log scale some times the scale doesn't show the complete data scale. As in the above plot we can see that only higher part of the data is plotted on the log scale. Lets see if we can get a clear picture when we actually apply log to the data rather than to the scale.
violin_plot([15,12], df_videos,'dislikes', 'category_name', 45,
'Distribution of dislikes over Categories(log data)', 'horizontal_log_data')
Comment:
From the above plots we can see that log scale doesn't show us a proper plot but when we use log applied data we get a better picture of the density spread of data. we can see that we have large scale data where the density of data is less at the min data points and max data points and the density is more in between. We can see that while most of the data shows relatively high variance but the density of 'sports', 'education' and 'shows' show more density compared to others. We can also see that bimodality is suggested in 'shows','pets & animals' category.
Lets now see the descriptive statistics of dislikes over categories.
box_plot([15,5], df_videos,'dislikes', 'category_name', 45,
'Distribution of dislikes over Categories(log data)', 'vertical')
Comment:
The above plot should help us get a understanding of descriptive statistics of dislikes over category names. But as the data has large sclae issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log scale is applied.
box_plot([15,12], df_videos,'dislikes', 'category_name', 45,
'Distribution of dislikes over Categories(log data)', 'horizontal_log')
Comment:
With the log scale applied we definitely have a better picture of the statistics but we don't have the complete data. The min values of the data are missing in the plot, this happens because of log scale. Lets try and see if we can get a better picture when we plot log transformed data.
box_plot([15,12], df_videos,'dislikes', 'category_name', 45,
'Distribution of dislikes over Categories(log data)', 'horizontal_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics of dislikes over categories. We can see that the data has outliers before min value as well as after max value for some categories. Also, we can see that some of the data is negative skewed like shows, gaming, entertainment, science and tech, travel and event etc while nonprofits & activism category shows data is positive skewed.
Let us now go ahead and check the density spread of comment count data over categories
violin_plot([15,5], df_videos,'comment_count', 'category_name', 45,
'Distribution of comment_count over Categories(log data)', 'vertical')
Comment:
The Above plot shows us the density of data comment count for all the categories but the interpretation is not clear as the data has large sclae and looks overplotted because of lard scale and having more data in only perticular range. Lets try and apply the log scale to see if we can get a better picture of the distribution of comment count across the categories.
violin_plot([15,5], df_videos,'comment_count', 'category_name', 45,
'Distribution of comment_count over Categories(log data)', 'vertical_log')
Comment:
when applying log scale some times the scale doesn't show the complete data scale. As in the above plot we can see that only higher part of the data is plotted on the log scale. Lets see if we can get a clear picture when we actually apply log to the data rather than to the scale.
violin_plot([15,12], df_videos,'comment_count', 'category_name', 45,
'Distribution of comment_count over Categories(log data)', 'horizontal_log_data')
Comment:
From the above plots we can see that log scale doesn't show us a proper plot but when we use log applied data we get a better picture of the density spread of data. we can see that we have large scale data where the density of data is less at the min data points and max data points and the density is more in between. We can see that while most of the data shows relatively high variance but the density of 'education' and 'shows' show more density compared to others. We can also see that bimodality is suggested in 'shows' category.
Lets now see the descriptive statistics of comment count over categories.
box_plot([15,5], df_videos,'comment_count', 'category_name', 45,
'Distribution of comment_count over Categories(log data)', 'vertical')
Comment:
The above plot should help us get a understanding of descriptive statistics of likes over category names. But as the data has large scale issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log scale is applied.
box_plot([15,12], df_videos,'comment_count', 'category_name', 45,
'Distribution of comment_count over Categories(log data)', 'horizontal_log')
Comment:
With the log scale applied we definitely have a better picture of the statistics but we don't have the complete data. The min values of the data are missing in the plot, this happens because of log scale. Lets try and see if we can get a better picture when we plot log transformed data.
box_plot([15,12], df_videos,'comment_count', 'category_name', 45,
'Distribution of comment_count over Categories(log data)', 'horizontal_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics of likes over categories. We can see that the data has outliers before min value as well as after max value for some categories. Also, we can see that some of the data is negative skewed like shows, gaming, autos &vehicles,sports, science and tech etc. while non profit, travel events, pets, education and few more are positive skewed.
Let us now go ahead and check the density spread of view data over channels
#we have many channels so lets just consider top 10 channels based on most published trending videos
df_top10_chn = df_videos[df_videos['channel_title'].isin(['ESPN', 'TheEllenShow',
'The Tonight Show Starring Jimmy Fallon',
'Jimmy Kimmel Live', 'Netflix',
'The Late Show with Stephen Colbert',
'NBA', 'CNN', 'Vox',
'The Late Late Show with James Corden'])]
violin_plot([15,12], df_top10_chn,'views', 'channel_title', 0,
'Distribution of views over top 10 channels', 'horizontal')
Comment:
The above plot shows us the dnetisy spread of data for the top 10 channels. Here we can see and understand the data as its only a subset of main data. From the above plot we can see that some of the channels show relatively high varience but the density of NBA, VOX, ESPN show more density than the other channels. We can also see that bimodality is suggested in VOX channel. Lets see how the data behaves when we plot log transformed data.
violin_plot([15,12], df_top10_chn,'views', 'channel_title', 0,
'Distribution of views over top 10 channels(log data)', 'horizontal_log_data')
Comment:
From the above log transformed data plot we can see that it gives us quite different look of the density spread of data when compared to data without transformation. Here we can see that VOX has more density than other which show relatively high varience but it is also VOX which suggests bimodality.
Let us now go ahead and check the descriptive statistics of the data
box_plot([15,12], df_top10_chn,'views', 'channel_title', 0,
'Distribution of views over top 10 channels', 'horizontal')
Comment:
From the above plot we can get an insight on descriptive statistics of the data. Here we can see that the plot show many outliers after the max value. we can also see that most of the data is positive skewed. Lets check how the data behaves with log transformation.
box_plot([15,12], df_top10_chn,'views', 'channel_title', 0,
'Distribution of views over top 10 channels(log data)', 'horizontal_log_data')
Comment:
From the above log transformed data plot we can see that it shows us different statistics than the original data. Here we can see that only few of the data shows outliers. The outliers are present before the min value and also after the max value. we can also see that some of the data is positive skewed and some of the data is negative skewed.
Lets go ahead and check the density spread of likes over channels
violin_plot([15,12], df_top10_chn,'likes', 'channel_title', 0,
'Distribution of likes over top 10 channels', 'horizontal')
Comment:
The above plot shows us the dnetisy spread of data for the top 10 channels. Here we can see and understand part the data as its only a subset of main data. From the above plot we can see that some of the channels show relatively high varience but the density of NBA, VOX, ESPN show more density than the other channels. Lets see how the data behaves when we plot log transformed data
violin_plot([15,12], df_top10_chn,'likes', 'channel_title', 0,
'Distribution of likes over top 10 channels (log data)', 'horizontal_log_data')
Comment:
From the above log transformed data plot we can see that it gives us quite different look of the density spread of data when compared to data without transformation. Here we can see that VOX, NBA have more density than other which show relatively high varience.
Let us now go ahead and check the descriptive statistics of the data
box_plot([15,12], df_top10_chn,'likes', 'channel_title', 0,
'Distribution of likes over top 10 channels', 'horizontal')
Comment:
From the above plot we can get an insight on descriptive statistics of the data. Here we can see that the plot show many outliers after the max value. we can also see that most of the data is positive skewed. Lets check how the data behaves with log transformation.
box_plot([15,12], df_top10_chn,'likes', 'channel_title', 0,
'Distribution of likes over top 10 channels (log data)', 'horizontal_log_data')
Comment:
From the above log transformed data plot we can see that it shows us different statistics than the original data. Here we can see that only few of the data shows outliers. The outliers are present before the min value and also after the max value. we can also see that some of the data is positive skewed and some of the data is negative skewed.
Lets go ahead and check the density spread of dislikes over channels
violin_plot([15,12], df_top10_chn,'dislikes', 'channel_title', 0,
'Distribution of dislikes over top 10 channels', 'horizontal')
Comment:
The above plot shows us the dentisy spread of data for the top 10 channels. Here we can see and understand the data as its only a subset of main data. From the above plot we can see that some of the channels show relatively high varience but the density of some channels show more density than the other channels. We can also see that bimodality is suggested in some channels like NBA. Lets see how the data behaves when we plot log transformed data
violin_plot([15,12], df_top10_chn,'dislikes', 'channel_title', 0,
'Distribution of dislikes over top 10 channels (log data)', 'horizontal_log_data')
Comment:
From the above log transformed data plot we can see that it gives us quite different look of the density spread of data when compared to data without transformation. some of the channels show relatively high varience but the density of some channels show more density than the other channels. Also, NBA suggests bimodality.
Let us now go ahead and check the descriptive statistics of the data
box_plot([15,12], df_top10_chn,'dislikes', 'channel_title', 0,
'Distribution of dislikes over top 10 channels', 'horizontal')
Comment:
From the above plot we can get an insight on descriptive statistics of the data. Here we can see that the plot show many outliers after the max value. we can also see that most of the data is positive skewed. Lets check how the data behaves with log transformation.
box_plot([15,12], df_top10_chn,'dislikes', 'channel_title', 0,
'Distribution of dislikes over top 10 channels (log data)', 'horizontal_log_data')
Comment:
From the above log transformed data plot we can see that it shows us different statistics than the original data. Here we can see that only few of the data shows outliers. The outliers are present before the min value and also after the max value. we can also see that most of the data is positive skewed.
Lets go ahead and check the density spread of comment_count over channels
violin_plot([15,12], df_top10_chn,'comment_count', 'channel_title', 0,
'Distribution of comment_count over top 10 channels', 'horizontal')
Comment:
The above plot shows us the dentisy spread of data for the top 10 channels. Here we can see and understand part of the data as its only a subset of main data. From the above plot we can see that some of the channels show relatively high varience but the density of NBA shows more density than the other channels. Lets see how the data behaves when we plot log transformed data
violin_plot([15,12], df_top10_chn,'comment_count', 'channel_title', 0,
'Distribution of comment_count over top 10 channels (log data)', 'horizontal_log_data')
Comment:
From the above log transformed data plot we can see that it gives us quite different look of the density spread of data when compared to data without transformation. Here we can see that VOX and The ellen show suggests bimodality.
Let us now go ahead and check the descriptive statistics of the data
box_plot([15,12], df_top10_chn,'comment_count', 'channel_title', 0,
'Distribution of comment_count over top 10 channels', 'horizontal')
Comment:
From the above plot we can get an insight on descriptive statistics of the data. Here we can see that the plot show many outliers after the max value. we can also see that most of the data is positive skewed. Lets check how the data behaves with log transformation.
box_plot([15,12], df_top10_chn,'comment_count', 'channel_title', 0,
'Distribution of comment_count over top 10 channels (log data)', 'horizontal_log_data')
Comment:
From the above log transformed data plot we can see that it shows us different statistics than the original data. Here we can see that only few of the data shows outliers. The outliers are present before the min value and also after the max value. we can also see that most of the data is positive skewed and some of the data is negative skewed.
Lets go ahead and check the density spread of views over months
violin_plot([15,12], df_videos,'views', 'month', 0,
'Distribution of views over months', 'horizontal')
Comment:
The Above plot shows us the density of data for all the months but the interpretation is not clear as the data has large scale. Lets try and apply the log transformation to see if we can get a better picture of the distribution of data across months.
violin_plot([15,12], df_videos,'views', 'month', 0,
'Distribution of views over months (log data)', 'horizontal_log_data')
Comment:
From the above log transformed data plot we can see that we get better picture of the data density. Here we can see that the density of summer months is more than the rest of the months also February suggests Bimodality.
Lets check the descriptive statistics of the data
box_plot([15,12], df_videos,'views', 'month', 0,
'Distribution of views over months', 'horizontal')
Comment:
The above plot should help us get a understanding of descriptive statistics of views over months. But as the data has large scale issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log transformed data is applied.
box_plot([15,12], df_videos,'views', 'month', 0,
'Distribution of views over months (log data)', 'horizontal_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics desired data We can see that the data has outliers before min value as well as after max value for some months. Also, we can see that some of the data is negative skewed ans some of the data is positive skewed.
Let us now go ahead and check the density spread of likes data over months
violin_plot([15,12], df_videos,'likes', 'month', 0,
'Distribution of likes over months', 'horizontal')
Comment:
The Above plot shows us the density of data for all the months but the interpretation is not clear as the data has large scale. Lets try and apply the log transformation to see if we can get a better picture of the distribution of data across months.
violin_plot([15,12], df_videos,'likes', 'month', 0,
'Distribution of likes over months (log data)', 'horizontal_log_data')
Comment:
From the above log transformed data plot we can see that we get better picture of the data density. Here we can see that the density of summer months is more than the rest of the months.
Lets check the descriptive statistics of the data
box_plot([15,12], df_videos,'likes', 'month', 0,
'Distribution of likes over months', 'horizontal')
Comment:
The above plot should help us get a understanding of descriptive statistics of likes over months. But as the data has large scale issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log transformed data is applied.
box_plot([15,12], df_videos,'likes', 'month', 0,
'Distribution of likes over months (log data)', 'horizontal_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics of desired data. We can see that the data has outliers before min value as well as after max value for some months. Also, we can see that some of the data is negative skewed ans some of the data is positive skewed.
Let us now go ahead and check the density spread of dislikes data over months
violin_plot([15,12], df_videos,'dislikes', 'month', 0,
'Distribution of dislikes over months', 'horizontal')
Comment:
The Above plot shows us the density of data for all the months but the interpretation is not clear as the data has large scale. Lets try and apply the log transformation to see if we can get a better picture of the distribution of data across months.
violin_plot([15,12], df_videos,'dislikes', 'month', 0,
'Distribution of dislikes over months (log data)', 'horizontal_log_data')
Comment:
From the above log transformed data plot we can see that we get better picture of the data density. Here we can see that the density of september is more than the rest of the months also July suggests Bimodality.
Lets check the descriptive statistics of the data
box_plot([15,12], df_videos,'dislikes', 'month', 0,
'Distribution of dislikes over months', 'horizontal')
Comment:
The above plot should help us get a understanding of descriptive statistics of views over months. But as the data has large scale issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log transformed data is applied.
box_plot([15,12], df_videos,'dislikes', 'month', 0,
'Distribution of dislikes over months (log data)', 'horizontal_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics of desired data. We can see that the data has outliers before min value as well as after max value for some months. Also, we can see that some of the data is negative skewed ans some of the data is positive skewed.
Let us now go ahead and check the density spread of comments count data over months
violin_plot([15,12], df_videos,'comment_count', 'month', 0,
'Distribution of comment_count over months', 'horizontal')
Comment:
The Above plot shows us the density of data for all the months but the interpretation is not clear as the data has large scale. Lets try and apply the log transformation to see if we can get a better picture of the distribution of data across months.
violin_plot([15,12], df_videos,'comment_count', 'month', 0,
'Distribution of comment_count over months (log data)', 'horizontal_log_data')
Comment:
From the above log transformed data plot we can see that we get better picture of the data density. Here we can see that the density of summer months is more than the rest of the months also February suggests Bimodality.
Lets check the descriptive statistics of the data
box_plot([15,12], df_videos,'comment_count', 'month', 0,
'Distribution of comment_count over months', 'horizontal')
Comment:
The above plot should help us get a understanding of descriptive statistics of comments count over months. But as the data has large scale issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log transformed data is applied.
box_plot([15,12], df_videos,'comment_count', 'month', 0,
'Distribution of comment_count over months (log data)', 'horizontal_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics of desired data. We can see that the data has outliers before min value as well as after max value for some months. Also, we can see that some of the data is negative skewed ans some of the data is positive skewed.
Let us now go ahead and check the density spread of views data over days
violin_plot([15,12], df_videos,'views', 'day', 0,
'Distribution of views over days', 'horizontal')
Comment:
The above plots helps us get an understanding of density dpread of data. Here we can see that monday saturday and tuesday show more density than the other days. Lets check how the data behaves when log transformation is applied on it.
violin_plot([15,12], df_videos,'views', 'day', 0,
'Distribution of views over days (log data)', 'horizontal_log_data')
Comment:
The above plot shows us the log transformed data. Here we can see that it gives us quite different look of the density spread of data when compared to data without transformation. It suggests that all the days have same density of data spread.
Lets check the descriptive statistics of the data.
box_plot([15,12], df_videos,'views', 'day', 0,
'Distribution of views over days', 'horizontal')
Comment:
The above plot should help us get a understanding of descriptive statistics of data. But as the data has large scale issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log transformed data is applied.
box_plot([15,8], df_videos,'views', 'day', 0,
'Distribution of views over days (log data)', 'horizontal_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics of desired data. We can see that the data has outliers before min value as well as after max value. Also, we can see that some of the data is negative skewed and some of the data is not skewed.
Let us now go ahead and check the density spread of likes data over days
violin_plot([15,12], df_videos,'likes', 'day', 0,
'Distribution of likes over days', 'horizontal')
Comment:
The above plots helps us get an understanding of density dpread of data. Here we can see that monday saturday and tuesday show more density than the other days. Lets check how the data behaves when log transformation is applied on it.
violin_plot([15,12], df_videos,'likes', 'day', 0,
'Distribution of likes over days (log data)', 'horizontal_log_data')
Comment:
The above plot shows us the log transformed data. Here we can see that it gives us quite different look of the density spread of data when compared to data without transformation. It suggests that all the days have same density of data spread.
Lets check the descriptive statistics of the data.
box_plot([15,8], df_videos,'likes', 'day', 0,
'Distribution of likes over days', 'horizontal')
Comment:
The above plot should help us get a understanding of descriptive statistics of data. But as the data has large scale issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log transformed data is applied.
box_plot([15,8], df_videos,'likes', 'day', 0,
'Distribution of likes over days (log data)', 'horizontal_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics of desired data. We can see that the data has many outliers before min value as well as some outliers after max values. Also, we can see that most of the data is negative skewed.
Let us now go ahead and check the density spread of dislikes data over days
violin_plot([15,12], df_videos,'dislikes', 'day', 0,
'Distribution of dislikes over days', 'horizontal')
Comment:
The above plots helps us get an understanding of density dpread of data. But as the data has large scale issue we don't have a clear picture of the density. Lets try to see if we can get a better picture if log transformed data is applied.
violin_plot([15,12], df_videos,'dislikes', 'day', 0,
'Distribution of dislikes over days (log data)', 'horizontal_log_data')
Comment:
The above plot shows us the log transformed data. Here we can see that it gives us quite different look of the density spread of data when compared to data without transformation. It suggests that monday, saturday and tuesday have more density than the others but without much difference.
Lets check the descriptive statistics of the data.
box_plot([15,8], df_videos,'dislikes', 'day', 0,
'Distribution of dislikes over days', 'horizontal')
Comment:
The above plot should help us get a understanding of descriptive statistics of data. But as the data has large scale issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log transformed data is applied.
box_plot([15,8], df_videos,'dislikes', 'day', 0,
'Distribution of dislikes over days (log data)', 'horizontal_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics of desired data. We can see that the data has outliers before min value as well as after max value. Also, we can see that some of the data is negative skewed and some of the data is positive skewed.
Let us now go ahead and check the density spread of comments count data over days
violin_plot([15,12], df_videos,'comment_count', 'day', 0,
'Distribution of comment_count over days', 'horizontal')
Comment:
The above plots helps us get an understanding of density dpread of data. Here we can see that monday shows more density than the other days. Lets check how the data behaves when log transformation is applied on it.
violin_plot([15,12], df_videos,'comment_count', 'day', 0,
'Distribution of comment_count over days (log data)', 'horizontal_log_data')
Comment:
The above plot shows us the log transformed data. Here we can see that it gives us quite different look of the density spread of data when compared to data without transformation. It suggests that all the days have almost same density of data spread.
Lets check the descriptive statistics of the data.
box_plot([15,8], df_videos,'comment_count', 'day', 0,
'Distribution of comment_count over days', 'horizontal')
Comment:
The above plot should help us get a understanding of descriptive statistics of data. But as the data has large scale issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log transformed data is applied.
box_plot([15,8], df_videos,'comment_count', 'day', 0,
'Distribution of comment_count over days (log data)', 'horizontal_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics of desired data. We can see that the data has outliers before min value as well as after max value. Also, we can see that some of the data is negative skewed and some of the data is positive skewed.
Let us now go ahead and check the density spread of views data over hours
violin_plot([18,7], df_videos,'views', 'hour', 0,
'Distribution of views over hour of the day', 'vertical')
Comment:
The above plots helps us get an understanding of density spread of data. Here the plot is not very clear but the plots suggests that the density of data at 19 military hour is more than the rest of the hours. Lets see how the data performs with log transformation.
violin_plot([18,7], df_videos,'views', 'hour', 0,
'Distribution of views over hour of the day (log data)', 'vertical_log_data')
Comment:
The above plot shows us the log transformed data. Here we can see that it gives us quite different look of the density spread of data when compared to data without transformation. It suggests that all the hours have similar density of data spread.
Lets check the descriptive statistics of the data.
box_plot([18,7], df_videos,'views', 'hour', 0,
'Distribution of views over hour of the day', 'vertical')
Comment:
The above plot should help us get a understanding of descriptive statistics of data. But as the data has large scale issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log transformed data is applied.
box_plot([18,7], df_videos,'views', 'hour', 0,
'Distribution of views over hour of the day (log data)', 'vertical_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics of desired data. We can see that the data has outliers before min value as well as after max value. Also, we can see that some of the data is negative skewed and some of the data is positive skewed.
Let us now go ahead and check the density spread of likes data over hours
violin_plot([18,7], df_videos,'likes', 'hour', 0,
'Distribution of likes over hour of the day', 'vertical')
Comment:
The above plots helps us get an understanding of density spread of data. Here the plot is not very clear but the plots suggests that the density of data at 0 and 23 military hour is more than the rest of the hours. Lets see how the data performs with log transformation.
violin_plot([18,7], df_videos,'likes', 'hour', 0,
'Distribution of likes over hour of the day (log data)', 'vertical_log_data')
Comment:
The above plot shows us the log transformed data. Here we can see that it gives us quite different look of the density spread of data when compared to data without transformation. It suggests that all the hours have similar density of data spread. Also the data shows relatively high varience.
Lets check the descriptive statistics of the data.
box_plot([18,7], df_videos,'likes', 'hour', 0,
'Distribution of likes over hour of the day', 'vertical')
Comment:
The above plot should help us get a understanding of descriptive statistics of data. But as the data has large scale issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log transformed data is applied.
box_plot([18,7], df_videos,'likes', 'hour', 0,
'Distribution of likes over hour of the day (log data)', 'vertical_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics of desired data. We can see that the data has outliers before min value as well as after max value. Also, we can see that some of the data is negative skewed and some of the data is positive skewed.
Let us now go ahead and check the density spread of dislikes data over hours
violin_plot([18,7], df_videos,'dislikes', 'hour', 0,
'Distribution of dislikes over hour of the day', 'vertical')
Comment:
The above plots helps us get an understanding of density spread of data. Here the plot is not very clear but the plots suggests that the density of data at 2,6,15 military hour is more than the rest of the hours. Lets see how the data performs with log transformation.
violin_plot([18,7], df_videos,'dislikes', 'hour', 0,
'Distribution of dislikes over hour of the day (log data)','vertical_log_data')
Comment:
The above plot shows us the log transformed data. Here we can see that it gives us quite different look of the density spread of data when compared to data without transformation. It suggests that all the hours have similar density of data spread. Also the data shows relatively high varience.
Lets check the descriptive statistics of the data.
box_plot([18,7], df_videos,'dislikes', 'hour', 0,
'Distribution of dislikes over hour of the day', 'vertical')
Comment:
The above plot should help us get a understanding of descriptive statistics of data. But as the data has large scale issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log transformed data is applied.
box_plot([18,7], df_videos,'dislikes', 'hour', 0,
'Distribution of dislikes over hour of the day (log data)', 'vertical_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics of desired data. We can see that the data has outliers before min value as well as after max value. Also, we can see that some of the data is negative skewed and some of the data is positive skewed.
Let us now go ahead and check the density spread of comments count data over hours
violin_plot([18,7], df_videos,'comment_count', 'hour', 0,
'Distribution of comment_count over hour of the day', 'vertical')
Comment:
The above plots helps us get an understanding of density spread of data. Here the plot is not very clear but the plots suggests that the density of data at 6 military hour is more than the rest of the hours. Lets see how the data performs with log transformation.
violin_plot([18,7], df_videos,'comment_count', 'hour', 0,
'Distribution of comment_count over hour of the day (log data)', 'vertical_log_data')
Comment:
The above plot shows us the log transformed data. Here we can see that it gives us quite different look of the density spread of data when compared to data without transformation. It suggests that all the hours have similar density of data spread. Also the data shows relatively high varience.
Lets check the descriptive statistics of the data.
box_plot([18,7], df_videos,'comment_count', 'hour', 0,
'Distribution of comment_count over hour of the day', 'vertical')
Comment:
The above plot should help us get a understanding of descriptive statistics of data. But as the data has large scale issue we don't have a clear picture of the statistics. Lets try to see if we can get a better picture if log transformed data is applied.
box_plot([18,7], df_videos,'comment_count', 'hour', 0,
'Distribution of comment_count over hour of the day (log data)', 'vertical_log_data')
Comment:
From the above plot we can see that we have complete picture of the data which shows us descriptive statistics of desired data. We can see that the data has outliers before min value as well as after max value. Also, we can see that some of the data is negative skewed and some of the data is positive skewed.
Let us now go ahead and check the trend lines of variables over months, days and overs
#funtion for line plot
def line_plot(figsize, data, x, y, angle, title, typee):
if typee == 'normal':
plt.figure(figsize = (figsize[0], figsize[1]))
#plot
sns.lineplot(y = y, x = x, data = data)
#set axis labels and title
plt.xticks(rotation=angle)
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'normal_log':
plt.figure(figsize = (figsize[0], figsize[1]))
#plot
y1= np.log(data[y]+1)
sns.lineplot(y = y1, x = x, data = data)
#set axis labels and title
plt.xticks(rotation=angle)
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
else:
print('please check typee')
line_plot([12, 5], df_videos, 'month_num', 'views', 45,
'Trend Line of views over months', 'normal')
Comment:
Line plots privide us with trend line of how a variable is moving(increasing or decreasing) according to time or over the period of time. Here we can see that the views are increasing in summer then drop completely from june to october and again tend to increase in winter.
line_plot([12, 5], df_videos, 'month_num', 'views', 45,
'Trend Line of views over months(log data)', 'normal_log')
Comment:
Line plots privide us with trend line of how a variable is moving(increasing or decreasing) according to time or over the period of time. Here we can see that the views are increasing in summer then drop and tend to show fluctuations from june to october and again tend to increase in winter
line_plot([12, 5], df_videos, 'month_num', 'likes', 45,
'Trend Line of likes over months', 'normal')
Comment:
Line plots privide us with trend line of how a variable is moving(increasing or decreasing) according to time or over the period of time. Here we can see that the likes are increasing in summer then drop completely from june to october and again tend to increase in winter.
line_plot([12, 5], df_videos, 'month_num', 'likes', 45,
'Trend Line of likes over months(log data)', 'normal_log')
Comment:
Line plots privide us with trend line of how a variable is moving(increasing or decreasing) according to time or over the period of time. Here we can see that the likes are increasing in summer then drop and tend to show fluctuations from june to october and again tend to increase in winter
line_plot([12, 5], df_videos, 'month_num', 'dislikes', 45,
'Trend Line of dislikes over months', 'normal')
Comment:
Line plots privide us with trend line of how a variable is moving(increasing or decreasing) according to time or over the period of time. Here we can see that the dislikes are increasing in summer then drop completely from june to october and again tend to increase in winter.
line_plot([12, 5], df_videos, 'month_num', 'dislikes', 45,
'Trend Line of dislikes over months(log data)', 'normal_log')
Comment:
Line plots privide us with trend line of how a variable is moving(increasing or decreasing) according to time or over the period of time. Here we can see that the dislikes are increasing in summer then drop and tend to show fluctuations from june to october and again tend to increase in winter
line_plot([12, 5], df_videos, 'month_num', 'comment_count', 45,
'Trend Line of comment_count over months', 'normal')
Comment:
Line plots privide us with trend line of how a variable is moving(increasing or decreasing) according to time or over the period of time. Here we can see that the comments count are increasing in summer then drop completely from june to october and again tend to increase in winter.
line_plot([12, 5], df_videos, 'month_num', 'comment_count', 45,
'Trend Line of comment_count over months(log data)', 'normal_log')
Comment:
Line plots privide us with trend line of how a variable is moving(increasing or decreasing) according to time or over the period of time. Here we can see that the comments count are increasing in summer then drop and tend to show fluctuations from june to october and again tend to increase in winter
line_plot([12, 5], df_videos, 'day', 'views', 45,
'Trend Line of views over days', 'normal')
Comment:
The above plot shows us the trend like of views according to days. Lets check the trend line for log data
line_plot([12, 5], df_videos, 'day', 'views', 45,
'Trend Line of views over days(log data)', 'normal_log')
Comment:
The above plot shows us the trend line for log transformed data. The trend line is similar to that of trend line plot without transformation
line_plot([12, 5], df_videos, 'day', 'likes', 45,
'Trend Line of likes over days', 'normal')
Comment:
The above plot shows us the trend like of likes according to days. Lets check the trend line for log data
line_plot([12, 5], df_videos, 'day', 'likes', 45,
'Trend Line of likes over days(log data)', 'normal_log')
Comment:
The above plot shows us the trend line for log transformed data. The trend line is little different to that of trend line plot without transformation
line_plot([12, 5], df_videos, 'day', 'dislikes', 45,
'Trend Line of dislikes over days', 'normal')
Comment:
The above plot shows us the trend dislikes of views according to days. Lets check the trend like for log data
line_plot([12, 5], df_videos, 'day', 'dislikes', 45,
'Trend Line of dislikes over days(log data)', 'normal_log')
Comment:
The above plot shows us the trend line for log transformed data. The trend line is very different to that of trend line plot without transformation.
line_plot([12, 5], df_videos, 'day', 'comment_count', 45,
'Trend Line of comment_count over days', 'normal')
Comment:
The above plot shows us the trend like of comment count according to days. Lets check the trend line for log data
line_plot([12, 5], df_videos, 'day', 'comment_count', 45,
'Trend Line of comment_count over days(log data)', 'normal_log')
Comment:
The above plot shows us the trend line for log transformed data. The trend line is very different to that of trend line plot without transformation
line_plot([12, 5], df_videos, 'hour', 'views', 45,
'Trend Line of views over hour of the day', 'normal')
Comment:
The above plot shows us the trend like of views according to hours of day. Lets check the trend line for log data
line_plot([12, 5], df_videos, 'hour', 'views', 45,
'Trend Line of views over hour of the day(log data)', 'normal_log')
Comment:
The above plot shows us the trend line for log transformed data. Here the trend line shows much more fluctuations than the plot without transformation
line_plot([12, 5], df_videos, 'hour', 'likes', 45,
'Trend Line of likes over hour of the day', 'normal')
Comment:
The above plot shows us the trend line of likes according to hours of day. Lets check the trend line for log data
line_plot([12, 5], df_videos, 'hour', 'likes', 45,
'Trend Line of likes over hour of the day(log data)', 'normal_log')
Comment:
The above plot shows us the trend line for log transformed data. Here the trend line shows much more fluctuations than the plot without transformation
line_plot([12, 5], df_videos, 'hour', 'dislikes', 45,
'Trend Line of dislikes over hour of the day', 'normal')
Comment:
The above plot shows us the trend line of dislikes according to hours of day. Lets check the trend line for log data
line_plot([12, 5], df_videos, 'hour', 'dislikes', 45,
'Trend Line of dislikes over hour of the day(log data)', 'normal_log')
Comment:
The above plot shows us the trend line for log transformed data. Here the trend line shows much more fluctuations than the plot without transformation
line_plot([12, 5], df_videos, 'hour', 'comment_count', 45,
'Trend Line of comment_count over hour of the day', 'normal')
Comment:
The above plot shows us the trend line of comments count according to hours of day. Lets check the trend line for log data
line_plot([12, 5], df_videos, 'hour', 'comment_count', 45,
'Trend Line of comment_count over hour of the day(log data)', 'normal_log')
Comment:
The above plot shows us the trend line for log transformed data. Here the trend line shows much more fluctuations than the plot without transformation
Multivariate exploration of data
Multivariate exploration of data involves in analyzing more than 2 variables at a time be it quantitative variables or quanlitative variables. It helps us determine how multiple variables depend on each other.
Lets check how all the variables relate to each other with the help of heat map.
#figsize
plt.figure(figsize = [9,7])
#plot
sns.heatmap(df_videos[['likes','views','dislikes','comment_count']].corr(), annot = True,cmap = 'viridis_r');
Comment:
Heatmap gives us a quick overview of corelation between all the variables in a dataset. From the above plot we can see that the highest corelation is between views and likes whereas the lowest corelation is between likes and dislikes.
Lets us go ahead and create subset dataframe which gives us information about all the trending videos published from top 3 categories and top 3 channels.
#top 3 categories dataframe
df_top3_cat = df_videos[df_videos['category_name'].isin(['Entertainment','Music','Howto & Style'])]
#top 3 channels dataframe
df_top3_chn = df_videos[df_videos['channel_title'].isin(['ESPN', 'TheEllenShow', 'The Tonight Show Starring Jimmy Fallon'])]
Lets write a function for a multivariate line plot
def multi_line(figsize, data, x, y, z, angle, title, typee):
if typee == 'normal':
plt.figure(figsize = (figsize[0], figsize[1]))
#plot
sns.lineplot(y = y, x = x, hue = z, data = data)
#set axis labels and title
plt.xticks(rotation=angle)
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'normal_log':
plt.figure(figsize = (figsize[0], figsize[1]))
#plot
y1= np.log(data[y]+1)
sns.lineplot(y = y1, x = x, hue = z, data = data)
#set axis labels and title
plt.xticks(rotation=angle)
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
else:
print('please check typee')
Lets plot the trend lines of views, likes, dislikes and comments over top 3 categories and month, day, hour of the day using df_top3_cat.
multi_line([12, 5], df_top3_cat, 'month', 'views', 'category_name', 45,
'Trend Line of views over categories and months', 'normal')
multi_line([12, 5], df_top3_cat, 'month', 'views','category_name', 45,
'Trend Line of views over categories and months(log data)', 'normal_log')
multi_line([12, 5], df_top3_cat, 'month', 'likes','category_name', 45,
'Trend Line of likes over categories and months', 'normal')
multi_line([12, 5], df_top3_cat, 'month', 'likes','category_name', 45,
'Trend Line of likes over categories and months(log data)', 'normal_log')
multi_line([12, 5], df_top3_cat, 'month', 'dislikes', 'category_name', 45,
'Trend Line of dislikes over categories and months', 'normal')
multi_line([12, 5], df_top3_cat, 'month', 'dislikes', 'category_name', 45,
'Trend Line of dislikes over categories and months(log data)', 'normal_log')
multi_line([12, 5], df_top3_cat, 'month', 'comment_count','category_name', 45,
'Trend Line of comment_count over categories and months', 'normal')
multi_line([12, 5], df_top3_cat, 'month', 'comment_count','category_name', 45,
'Trend Line of comment_count over categories and months(log data)', 'normal_log')
Comment:
The trend lines of views, likes, dislikes, comments count over months is shows in the above trend line plots. They are explained in detail in Exploratory analysis.
multi_line([12, 5], df_top3_cat, 'day', 'views', 'category_name', 45,
'Trend Line of views over categories and days', 'normal')
Comment:
The above plot shows us the trend line of views for top 3 categories over days
multi_line([12, 5], df_top3_cat, 'day', 'views','category_name', 45,
'Trend Line of views over categories and days(log data)', 'normal_log')
Comment:
The above plot shows us the trend line of views for top 3 categories over days for log transformed data
multi_line([12, 5], df_top3_cat, 'day', 'likes','category_name', 45,
'Trend Line of likes over categories and days', 'normal')
Comment:
The above plot shows us the trend line of likes for top 3 categories over days
multi_line([12, 5], df_top3_cat, 'day', 'likes','category_name', 45,
'Trend Line of likes over categories and days(log data)', 'normal_log')
Comment:
The above plot shows us the trend line of likes for top 3 categories over days for log transformed data
multi_line([12, 5], df_top3_cat, 'day', 'dislikes', 'category_name', 45,
'Trend Line of dislikes over categories and days', 'normal')
Comment:
The above plot shows us the trend line of dislikes for top 3 categories over days
multi_line([12, 5], df_top3_cat, 'day', 'dislikes', 'category_name', 45,
'Trend Line of dislikes over categories and days(log data)', 'normal_log')
Comment:
The above plot shows us the trend line of views for top 3 categories over days for log transformed data
multi_line([12, 5], df_top3_cat, 'day', 'comment_count','category_name', 45,
'Trend Line of comment_count over categories and days', 'normal')
Comment:
The above plot shows us the trend line of comment count for top 3 categories over days
multi_line([12, 5], df_top3_cat, 'day', 'comment_count','category_name', 45,
'Trend Line of comment_count over categories and days(log data)', 'normal_log')
Comment:
The above plot shows us the trend line of comment count for top 3 categories over days for log transformed data
multi_line([12, 5], df_top3_cat, 'hour', 'views', 'category_name', 45,
'Trend Line of views over categories and hour of the day', 'normal')
Comment:
The above plot shows us the trend line of views for top 3 categories over hours
multi_line([12, 5], df_top3_cat, 'hour', 'views','category_name', 45,
'Trend Line of views over categories and hour of the day(log data)', 'normal_log')
Comment:
The above plot shows us the trend line of views for top 3 categories over hours for log transformed data
multi_line([12, 5], df_top3_cat, 'hour', 'likes','category_name', 45,
'Trend Line of likes over categories and hour of the day', 'normal')
Comment:
The above plot shows us the trend line of likes for top 3 categories over hours
multi_line([12, 5], df_top3_cat, 'hour', 'likes','category_name', 45,
'Trend Line of likes over categories and hour of the day(log data)', 'normal_log')
Comment:
The above plot shows us the trend line of likes for top 3 categories over hours for log transformed data
multi_line([12, 5], df_top3_cat, 'hour', 'dislikes', 'category_name', 45,
'Trend Line of dislikes over categories and hour of the day', 'normal')
Comment:
The above plot shows us the trend line of dislikes for top 3 categories over hours
multi_line([12, 5], df_top3_cat, 'hour', 'dislikes', 'category_name', 45,
'Trend Line of dislikes over categories and hour of the day(log data)', 'normal_log')
Comment:
The above plot shows us the trend line of dislikes for top 3 categories over hours for log transformed data
multi_line([12, 5], df_top3_cat, 'hour', 'comment_count','category_name', 45,
'Trend Line of comment_count over categories and hour of the day', 'normal')
Comment:
The above plot shows us the trend line of comment count for top 3 categories over hours
multi_line([12, 5], df_top3_cat, 'hour', 'comment_count','category_name', 45,
'Trend Line of comment_count over categories and hour of the day(log data)', 'normal_log')
Comment:
The above plot shows us the trend line of comment count for top 3 categories over hours for log transformed data
Lets write functions for multivariate scatter plots
def multi_plot(figsize, data, x, y, z, title, typee):
if typee == 'hue':
#figsize
plt.figure(figsize = (figsize[0], figsize[1]))
#plot the data
sns.scatterplot(y= y, x= x, data = data,
hue = z, style = z)
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'hue_log':
#figsize
plt.figure(figsize = (figsize[0], figsize[1]))
#plot the data
sns.scatterplot(y= y, x= x, data = data,
hue = z, style = z)
#lets view the data in log scale
plt.xscale('log');
plt.yscale('log');
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'hue_log_data':
#figsize
plt.figure(figsize = (figsize[0], figsize[1]))
#plot the data
x1= np.log(data[x])
y1= np.log(data[y])
sns.scatterplot(y= y1, x= x1, data = data,
hue = z, style = z)
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'size':
#figsize
plt.figure(figsize = (figsize[0], figsize[1]))
#plot the data
sns.scatterplot(y= y, x= x, data = data,
hue = z, size = z)
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'size_log':
#figsize
plt.figure(figsize = (figsize[0], figsize[1]))
#plot the data
sns.scatterplot(y= y, x= x, data = data,
hue = z, size = z)
#lets view the data in log scale
plt.xscale('log');
plt.yscale('log');
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
elif typee == 'size_log_data':
#figsize
plt.figure(figsize = (figsize[0], figsize[1]))
#plot the data
x1= np.log(data[x])
y1= np.log(data[y])
sns.scatterplot(y= y1, x= x1, data = data,
hue = z, size = z)
#set axis labels and title
plt.ylabel(y.upper())
plt.xlabel(x.upper())
plt.title(title);
else:
print('please check typee')
Now lets try to find the corelation between variables over categories and channels using scatterplot for top 3 categories and top three channels
multi_plot([12,7], df_top3_cat, 'views', 'likes', 'category_name',
'Distribution of likes over views and categories', 'hue')
Comment:
The above plot scatter plot helps us find corelation between variables. Here we can see that view and likes have a positive corelation. But as the data has large scale issue we will have cross check the relation. Lets check corelation in log transformed data
multi_plot([12,7], df_top3_cat, 'views', 'likes', 'category_name',
'Distribution of likes over views and categories(log scale)', 'hue_log')
Comment:
The log scale data doesnt show much data because of scale issues. Lets check the log transformed data.
multi_plot([12,7], df_top3_cat, 'views', 'likes', 'category_name',
'Distribution of likes over views and categories', 'hue_log_data')
Comment:
From the plot of log transformed data we can see that the corelation between views and likes is positive and strong
Lets check the corelation for views and dislikes
multi_plot([12,7], df_top3_cat, 'views', 'dislikes', 'category_name',
'Distribution of dislikes over views and categories', 'hue')
Comment:
The above plot scatter plot helps us find corelation between variables. Here we can see that views and dislikes doesn't seem to have any corelation. But as the data has large scale issue we will have cross check the relation. Lets check corelation in log transformed data
multi_plot([12,7], df_top3_cat, 'views', 'dislikes', 'category_name',
'Distribution of dislikes over views and categories(log data)', 'hue_log_data')
Comment:
From the plot of log transformed data we can see that the corelation between views and dislikes is positive.
Lets check the corelation for views and comments count
multi_plot([12,7], df_top3_cat, 'views', 'comment_count', 'category_name',
'Distribution of comment_count over views and categories', 'hue')
Comment:
The above plot scatter plot helps us find corelation between variables. Here we can see that views and comments count seems to have a positive corelation but very weak relation. But as the data has large scale issue we will have cross check the relation. Lets check corelation in log transformed data
multi_plot([12,7], df_top3_cat, 'views', 'comment_count', 'category_name',
'Distribution of comment_count over views and categories(log data)', 'hue_log_data')
Comment:
From the plot of log transformed data we can see that the corelation between views and likes is positive.
Lets check the corelation for views and likes for top 3 channels.
multi_plot([12,7], df_top3_chn, 'views', 'likes', 'channel_title',
'Distribution of likes over views and channels', 'hue')
Comment:
The above plot scatter plot helps us find corelation between variables. Here we can see that views and likes have a positive corelation. But as the data has large scale issue we will have cross check the relation. Lets check corelation in log transformed data
multi_plot([12,7], df_top3_chn, 'views', 'likes', 'channel_title',
'Distribution of likes over views and channels(log data)', 'hue_log_data')
Comment:
From the plot of log transformed data we can see that the corelation between views and likes is positive and strong
Lets check the corelation for views and dislikes
multi_plot([12,7], df_top3_chn, 'views', 'dislikes', 'channel_title',
'Distribution of dislikes over views and channels', 'hue')
Comment:
The above plot scatter plot helps us find corelation between variables. Here we can see that view and dislikes doesn't have a proper corelation. But as the data has large scale issue we will have cross check the relation. Lets check corelation in log transformed data
multi_plot([12,7], df_top3_chn, 'views', 'dislikes', 'channel_title',
'Distribution of dislikes over views and channels(log data)', 'hue_log_data')
Comment:
From the plot of log transformed data we can see that the corelation between views and dislikes is positive and strong
Lets check the corelation for views and comments count
multi_plot([12,7], df_top3_chn, 'views', 'comment_count', 'channel_title',
'Distribution of comment_count over views and channels', 'hue')
Comment:
The above plot scatter plot helps us find corelation between variables. Here we can see that view and comment count have a positive corelation but weak relation. But as the data has large scale issue we will have to cross check the relation. Lets check corelation in log transformed data
multi_plot([12,7], df_top3_chn, 'views', 'comment_count', 'channel_title',
'Distribution of comment_count over views and channels(log data)', 'hue_log_data')
Comment:
From the plot of log transformed data we can see that the corelation between views and comments count is positive and weak.
Lets check the corelation for likes, dislikes and comments count for all the data
multi_plot([12,7], df_videos, 'likes', 'dislikes', 'comment_count',
'Distribution of likes, dislikes and comments', 'size')
Comment:
From the above plot it is hard to get a insight of how the data is corelated to eachother because of large scale issue. Lets try and plot the data with log scale and log trasformed data to see if we can get a better insight.
multi_plot([12,7], df_videos, 'likes', 'dislikes', 'comment_count',
'Distribution of likes, dislikes and comments(log scale)', 'size_log')
Comment:
The log scale data doesnt show any data because of scale issues. Lets check the log transformed data.
multi_plot([12,7], df_videos, 'likes', 'dislikes', 'comment_count',
'Distribution of likes, dislikes and comments(log data)', 'size_log_data')
Comment:
From the plot of log transformed data we can see that the corelation between likes, dislikes and comments count is positive and strong but we can also see that the relation is weak for the min values of the data.
# save the cleaned dataset to csv file
df_videos = df_videos.to_csv('us_videos_cleaned.csv', index = False)
In this report we work on dataset from kaggle which is youtube trending videos. We only access and analyze US youtube trending videos from November 2017 to July 2018 dataset. First we accessed the data to see if the data has any quality and tidiness issues, then we cleaned the identified data issues. After performing all cleaning we saved the dataset into a csv file for analysis and exploration of data visually.
Here we have many qualitative and quantitative variables, we also perform calculations to get new variables to help get better visualization and readable plots. In this report we performed univariate, bivariate, and multivariate plots to explore many relationships in the data set.
From all the plots we learn that likes, comments have positive relation to any variable and dislikes sometimes have negative relation and also sometimes the relation of dislikes and other variables is not so clear. We also learnt that Entertainment category and ESPN channel has most published trending videos where as when we see the data spread of likes views and comments we notice that Music category have most views, likes and comments compared to others(explained in explanatory analysis file and slides).
LIMITATIONS:
The insights drawn from the analysis and visualization is purely based on the given data. The major limitation with the data is having large scale and overplotting issues and outliers which made plots unreadable. We performed log calculations and applied log scale to make the plots readable. One other limitation is the performed analysis is purely for US zone and it may not apply for other zones provided in the dataset.
SOURCES